Chapter 5 · Deep Learning Fundamentals

Weight Initialization

Why the starting values of your neural network's weights decide if it learns at all — and how to get them right.

Why Starting Values Matter

Before a neural network learns anything, every weight must be given an initial value. This isn't just a technicality — it's the difference between a network that converges in minutes and one that never learns at all.

Think of it like tuning a guitar.

If every string starts wildly out of tune, the musician (optimizer) has to make huge corrections that may overshoot. If every string starts at the exact same note, you can't play a chord — you need variety. The ideal: each string starts close to its correct pitch, with just enough variation.

The golden rule: keep activation variance ≈ constant across every layer.

If variance shrinks layer by layer, gradients vanish (network stops learning). If variance grows, gradients explode (network becomes unstable). Good initialization keeps the signal stable as it flows forward and backward.

🪞

Symmetry

All weights identical → all neurons compute the same thing → network collapses to 1 neuron per layer.

📉

Vanishing Gradients

Weights too small → activations shrink exponentially through layers → deep layers never update.

💥

Exploding Gradients

Weights too large → activations blow up → loss becomes NaN, training crashes.

How Signals Flow Through Layers

Watch what happens to the activation distribution as a signal passes through a 10-layer network with different init schemes. Each bar shows the spread (variance) of activations at that layer.

Activation Variance Across Layers

See how different initializations affect signal propagation

A Brief History

Pre-2010 — The Dark Ages

Small Random

Weights drawn from N(0, 0.01). Worked for 2–3 layer nets. Deeper? Gradients vanished — activations shrank to zero by layer 5.

2010 — Glorot & Bengio

Xavier / Glorot Init

Insight: balance the variance for both forward and backward passes. For a layer with n_in inputs and n_out outputs, average them:

Var(W) = 2 / (n_in + n_out)

Designed for sigmoid / tanh (approximately linear near zero). Broke for ReLU, which kills half the signal.

2015 — He et al.

He / Kaiming Init

ReLU zeros out ~50% of values, so variance drops by half each layer. Fix: double the variance.

Var(W) = 2 / n_in

That factor of 2 is the entire difference. Enabled ResNet-50 and beyond.

📐 Xavier Derivation (tap to expand)

+
1

Linear layer: y_j = Σ W_ji × x_i (sum over n_in inputs)

2

Assume inputs are i.i.d. with Var(x)=1, zero mean, and W independent of x.

3

Then Var(y) = n_in × Var(W) (variance of a sum of independent products).

4

Want Var(y) = 1 (same as input) → set Var(W) = 1/n_in.

5

Backward pass gives the same constraint but with n_out → Var(W) = 1/n_out.

6

Compromise: average both → Var(W) = 2/(n_in + n_out). Done!

📐 He Derivation — just one extra line (tap to expand)

+
1

Same setup: Var(z) = n_in × Var(W) (pre-activation).

2

After ReLU: half the values become zero → Var(ReLU(z)) ≈ ½ × Var(z).

3

So Var(y) = ½ × n_in × Var(W). Want this = 1.

4

Solve → Var(W) = 2/n_in. The factor of 2 compensates for ReLU's 50% kill rate.

5

Only uses n_in (fan-in), because ReLU's unbounded output makes forward-pass control more important.

Which Init Should You Use?

In practice, this is the flowchart that matters:

What's your architecture? CNN (ResNet etc.) He Normal Var(W) = 2/n_in Transformer Xavier Normal LayerNorm handles the rest RNN / LSTM Orthogonal + Xavier Forget bias = 1 100+ layers? + Fixup / ReZero Init residual scale α=0 100+ layers? + Depth Scaling W_out /= √(2·num_layers)

Modern Recipes at a Glance

Architecture Initialization Why
CNNs (ResNet) He Normal ReLU activations need the 2× factor
Transformers (GPT, BERT) Xavier Normal LayerNorm stabilizes; GELU ≈ linear near 0
ViT Trunc Normal σ=0.02 Stabilizes patch embeddings
RNN / LSTM Orthogonal + Xavier Orthogonal prevents exploding through time
GAN Generator N(0, 0.02) Stabilizes fragile adversarial training
GAN Discriminator He Normal Leaky ReLU typical
Transformer layer-by-layer:

Embeddings → N(0, 0.02)  |  Q/K/V, FFN → Xavier  |  LayerNorm → γ=1, β=0  |  Output proj (deep) → scale by 1/√num_layers

Don't Forget These

⚖️

Biases

Hidden layers → zero. Output bias for imbalanced classes → log(p / (1-p)) for faster convergence.

🔄

Norm Layers

γ=1, β=0 → starts as identity. Reduces sensitivity to weight init in earlier layers.

🧊

Transfer Learning

Keep pretrained weights. Only init new head (Xavier/He). Use tiny LR: 1e-5 to 1e-6.

🏗️

ReZero / Fixup

For 100+ layer nets: init residual branch scale α=0 so block starts as identity: y = x + 0·F(x).

PyTorch Cheat Sheet

PyTorch defaults (nn.Linear, nn.Conv2d) already use Kaiming Uniform. You usually only override for transformers, GANs, or LSTMs.

import torch.nn as nn # He Normal (CNNs with ReLU) nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu') # Xavier Normal (Transformers) nn.init.xavier_normal_(layer.weight) # Embeddings (GPT-style) nn.init.normal_(layer.weight, mean=0, std=0.02) # Truncated Normal (ViT) nn.init.trunc_normal_(layer.weight, std=0.02) # Orthogonal (RNN recurrent weights) nn.init.orthogonal_(layer.weight) # LSTM forget gate bias → 1 (keep memory early on) n = bias.size(0) bias[n//4:n//2].fill_(1.0)

Top Questions & Crisp Answers

Why not initialize all weights to zero?
Symmetry problem: every neuron computes the same thing, gets the same gradient, learns the same feature. The entire layer collapses to a single neuron. You need random diversity to break symmetry.
Xavier vs He — when to use which?
He for any network using ReLU (most CNNs). Xavier for sigmoid/tanh or transformers (where LayerNorm makes the activation choice less critical). One-liner: "ReLU + CNN → He. Transformer → Xavier."
What's the initialization for GPT/BERT?
Embeddings: N(0, 0.02). Linear layers (Q/K/V, FFN): Xavier Normal. LayerNorm: γ=1, β=0. For very deep models (GPT-3): scale output projections by 1/√(2·num_layers).
How does batch norm affect initialization?
It normalizes activations, making training less sensitive to init. Still use He/Xavier for best results — but batch norm gives you more margin for error.
When does init matter most?
Very deep nets (50+ layers) without normalization, GANs (affects stability), and networks without skip connections. It matters least in transfer learning and shallow nets.